home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Power Programmierung
/
Power-Programmierung CD 2 (Tewi)(1994).iso
/
doc
/
mir
/
14analyz
< prev
next >
Wrap
Text File
|
1992-07-01
|
34KB
|
868 lines
════════════════════════════════════════
4. FIRST STEPS IN DATA ANALYSIS
════════════════════════════════════════
══════════════════════
4.1 Objectives
══════════════════════
Let's imagine that you have been assigned an indexing
project. You are to prepare an indexed version of a large
database, ready for other people to search in it for information.
How shall you proceed? In this topic, we deal with first steps in
hands-on analysis of machine-readable databases.
You may have to start by fending off shortcuts. Some
managers and clients want work completed yesterday. The first
question you are asked is: "How long is it going to take?" "The
data is the same as last time; we want only a few minor changes.
This won't take very long, will it?" Refuse any time estimate
until your analysis is complete. The ultimate objective is a
satisfied user. Shortcuts at the analysis and specification stage
rarely (make that "never") serve that objective. The best strategy
is to write time estimates once the project is fully specified in
writing and the data has been analyzed.
Analysis logically comes before data preparation; we
can't prepare something for indexing until we know what we have.
If we are dealing with familiar data, we still have to watch for
data quality problems... inconsistencies or unexpected patterns
that are common within large accumulations of data.
The analysis stage provides information that is needed
to prepare the data. After analyzing the data, we should have
enough information so that we will be able to:
» extract searchable content, that is, separate out
displayable text with punctuation and each search term
(word, phrase, numeric value, etc.) intact, at the same
time screening out all material that is not part of the
intended display;
» recognize record separations, that is, identify
divisions between units in cases where data is
logically divided into units which are meaningful to
the searcher (a property in a real estate file, a
person in a personnel file, a subsection in a manual of
regulations, a heading in a book or magazine article,
etc.);
» recognize field separations, that is, identify
divisions between elements of data that take on
different meaning according to where they occur within
a record. Examples of fields... purchase order number,
street address, city, postal code, quantity, item
description, cost per unit, etc.
» recognize formatting aids, that is, identify bytes
within the data that are intended to control display of
data... indents from the margin, shifts to different
fonts (e.g., italics), table spacing, etc.
This may strike you as simply finding out what is
needed to reduce data to formatted printable text. In a sense, it
is. But there are some pitfalls along the way. The trick is to be
able to recognize the pitfalls early in the indexing process while
the costs are still low. It's quite discouraging to index a
billion bytes and then find the index list contains large numbers
of false or meaningless search terms.
══════════════════════════════════════════════
4.2 Learn how the data was accumulated
══════════════════════════════════════════════
Ask questions. The data came from somewhere. How was
the data put together? Specifically:
» Was the same method used consistently to assemble all
the data? Or were there changes along the way? (If
methods or programs changed, then each subset of data
must be separately analyzed.) If possible, find out
why the data-gathering method changed. Was the change
prompted by a quality control problem, or the
introduction of new technology? In either case, check
the first part of the new data carefully for errors.
Starting up in unfamiliar surroundings can lead to
extra errors for a while. If the change was made on
account of quality control problems, re-check the
portion of data created just before the change.
» Was the data scanned from print media? Which scanning
technology and software were used? What was the
release date of that software? What was the quality of
the print media? Was the layout regular? How
thoroughly has the result been checked for reliability?
» If keypunching was used as a low cost substitute for
scanning printed material, were the keypunchers and
their supervisors familiar with the subject matter?
Were they working in their own language? What methods
of verification were used? (For example, keying by a
second operator is not always reliable.) How much
pressure were the keyers under? Were they being
monitored and paid by the kilo-keystroke as has often
been the case? People under pressure tend to make more
errors than those working "at their own pace".
» Has the data been used and updated in ways that would
sift out errors?
═══════════════════════════════════════════
4.3 Learn how the data will be used
═══════════════════════════════════════════
Put yourself in the position of a person searching
within this database. Don't settle for educated guesses; find out
for whom the data is being indexed and ask questions: What are the
needs of prospective searchers? How do they look for information
now: manually, using a mainframe computer, not at all? In what
ways can the indexing setup add to the value of the data for their
needs? (For example, a computerized phone book can be organized
one way to create mailing lists, in quite another to optimize the
speed of looking up individual names.) What kinds of search term
combinations are typically used? Perhaps some forms of search
were deemed impractical in the past. If searchers were invited to
imagine away all restrictions, what would they like to be able to
do?
Marketing questions are often outside the interests of
technical people. That's unfortunate, since all sorts of technical
side trips to nowhere might be avoided by interviewing even a few
potential end users.
═══════════════════════════════════════════
4.4 Access to samples and hard copy
═══════════════════════════════════════════
After acquiring technical background re the data and
marketing information on potential users, the key issues are:
» access to the data on media that can be handled by your
computer. Non-standard media still exist. ("I just
received the diskettes from the first volume on 8 inch
disks"... letter in February 1992). And there are
standard media that may not be useful to you... nine
track 6250 BPI tape doesn't feed the average personal
computer; 1.2 megabyte floppies aren't worth much on
a 360K floppy drive.
» ability to extract samples that are representative.
This depends in part on the degree of consistency of
the data. Pay particular attention to the first
several thousand bytes, and the very last several
thousand bytes. If the data set is large, extract
randomly selected portions in between. If there are
known changes in how the data was accumulated, put
together samples from each variation.
» access to hard copy. If at all possible, get paper
printouts that match the samples. The paper version is
helpful for identifying format and typesetting codes.
The way the hard copy is laid out also tells volumes
about the search philosophy used to date. Ask users
how well the printed layout served their needs.
════════════════════════════════════
4.5 Access to software tools
════════════════════════════════════
A variety of programs (software tools) are supplied
with each tutorial, in source code and in executable form. Source
code has the extension ".C" and the DOS executable normally has the
extension ".EXE". There is also the occasional batch file, a
series of commands in a simple text file that can be interpreted by
DOS. If you have not already done so, include the directory that
has the executable version in the series of PATH names in the
AUTOEXEC.BAT file on your computer. Each program has a
description, or help screen, which is displayed when you input the
command (the program name without the extension) followed by /U
(for Usage) or ? (for help). Normally we will include the help
screen in the tutorial the first time a program is mentioned.
════════════════════════════════════════════════
4.6 Extracting samples from larger files
════════════════════════════════════════════════
The program CPB (copy bytes) is useful for extracting
samples from larger files. Here is its help screen description:
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Usage cpb byte_count start_byte input_file output_file
Copy bytes... Copy any portion of any file to a new file.
Start at a specified byte, copy a specified byte count.
Standard output may not be used because DOS drops carriage
returns and CTL-Z.
input: Any file whatsoever.
output: Portion of the same file.
writeup: MIR TUTORIAL ONE, topic 4
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Suppose you wish to separate out the first 10,000 bytes
of a file named BIG_DATA into a file called SAMPLE.000. The
command would be:
CPB 10000 0 BIG_DATA SAMPLE.000
To isolate 7,500 bytes from the middle of a 200,000 byte file, the
command takes the form:
CPB 7500 100000 BIG_ONE MY_SAMPL
CPB (copy bytes) is particularly useful for massive files. Here's
how to call out a sample between 4,000 and 5,000 bytes in length
from the end of a file that is 248,885,237 bytes long:
CPB 5000 248881000 BIGGER.YET BIGGER.END
The file BIGGER.END in this case would be 4,237 bytes long. CPB
simply stops copying when it reaches the end of the input file.
Sampling is one of many uses of CPB. We will use it in
a variety of ways further along.
If you wish to make a UNIX version of CPB, consider
removing the input and output file names and using standard input
and output instead. That is safe since UNIX does not quietly mess
up on carriage returns and CTL-Z characters. The advantage of
standard input and output is the convenience of piping from one
program to another without the clutter of intervening files.
═══════════════════════════════════════════
4.7 Byte surveys - a worked example
═══════════════════════════════════════════
We start the analysis with a byte survey of each
sample, or of one file that combines all your data samples. If you
choose to combine samples, use the DOS COPY command to join them
together, being very sure to use the "/b" binary flag:
COPY /B HEADER + BATCH1 + BATCH2 + TAIL BIG_ONE
The destination file BIG_ONE has the four source files concatenated
together in order: HEADER, BATCH1, BATCH2, TAIL.
If you have a series of files named SAMPLE.001,
SAMPLE.002, SAMPLE.003, etc.,
COPY /B SAMPLE.* BIG_ONE
Included among the MIR support files supplied with
TUTORIAL ONE is a 238,312 byte file SVP_TXT. This file contains
English translation of some 17th century correspondence of the
French priest and reformer, Vincent de Paul, who founded the
"Congregation of the Mission". The Vincentians, generous as
always, have given permission to use the correspondence for an
example in demonstrating indexing and retrieval methods.
The file SVP_TXT is used here to introduce tools and
first steps in data analysis. The file's quarter megabyte size is
no problem. For example, the program A_BYTES classifies and counts
every one of the 238,312 bytes in only four seconds on a slow
machine (AT 80286 at 12 Megahertz).
The character survey is produced by the program A_BYTES
("analyze bytes")...
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
usage: a_bytes [ /L ] file_name[s]
Analyze the bytes (characters) used within any file, report
the frequency of each byte present. If the location flag
/L is set, include offsets of the first 8 occurrences of
each byte pattern present.
input: Any file[s] whatsoever.
output: file_name.BYT which contains up to 256 lines, one line for
each different byte present. The byte is shown first in
printable OR octal form, then the hexadecimal equivalent.
The third column is frequency. The fourth column shows
percentage of total occurrences within the file.
If the /L locations option is selected, the output file is
name file_name.LOC and the offsets of the first up to 8
occurrences follow at the end of each line.
writeup: MIR TUTORIAL ONE, topic 4
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
A_BYTES SVP_TXT
creates a report file SVP_TXT.BYT which is a byte analysis of the
entire file.. Here is that report:
\012 [0A] 4117 1.7%
\015 [0D] 4117 1.7%
\032 [1A] 1 0.0%
\040 [20] 38883 16.3%
! [21] 173 0.1%
' [27] 87 0.0%
( [28] 38 0.0%
) [29] 38 0.0%
, [2C] 2754 1.2%
- [2D] 266 0.1%
. [2E] 1922 0.8%
0 [30] 258 0.1%
1 [31] 560 0.2%
2 [32] 168 0.1%
3 [33] 117 0.0%
4 [34] 485 0.2%
5 [35] 170 0.1%
6 [36] 771 0.3%
7 [37] 174 0.1%
8 [38] 198 0.1%
9 [39] 85 0.0%
: [3A] 87 0.0%
; [3B] 158 0.1%
< [3C] 1511 0.6%
= [3D] 935 0.4%
> [3E] 1511 0.6%
? [3F] 62 0.0%
@ [40] 935 0.4%
A [41] 658 0.3%
B [42] 225 0.1%
C [43] 403 0.2%
D [44] 962 0.4%
E [45] 1357 0.6%
F [46] 124 0.1%
G [47] 388 0.2%
H [48] 501 0.2%
I [49] 1715 0.7%
J [4A] 90 0.0%
K [4B] 5 0.0%
L [4C] 519 0.2%
M [4D] 1029 0.4%
N [4E] 466 0.2%
O [4F] 536 0.2%
P [50] 476 0.2%
Q [51] 9 0.0%
R [52] 234 0.1%
S [53] 534 0.2%
T [54] 2364 1.0%
U [55] 161 0.1%
V [56] 166 0.1%
W [57] 132 0.1%
X [58] 841 0.4%
Y [59] 106 0.0%
Z [5A] 2 0.0%
[ [5B] 67 0.0%
] [5D] 67 0.0%
^ [5E] 479 0.2%
a [61] 11934 5.0%
b [62] 2128 0.9%
c [63] 3465 1.5%
d [64] 6330 2.7%
e [65] 21212 8.9%
f [66] 3581 1.5%
g [67] 2899 1.2%
h [68] 9818 4.1%
i [69] 11036 4.6%
j [6A] 121 0.1%
k [6B] 897 0.4%
l [6C] 5929 2.5%
m [6D] 4209 1.8%
n [6E] 10777 4.5%
o [6F] 15224 6.4%
p [70] 2305 1.0%
q [71] 96 0.0%
r [72] 10071 4.2%
s [73] 10261 4.3%
t [74] 15670 6.6%
u [75] 5655 2.4%
v [76] 1945 0.8%
w [77] 3214 1.3%
x [78] 273 0.1%
y [79] 3891 1.6%
z [7A] 76 0.0%
| [7C] 28 0.0%
é [82] 43 0.0%
â [83] 1 0.0%
à [85] 1 0.0%
ç [87] 7 0.0%
è [8A] 4 0.0%
î [8C] 4 0.0%
ô [93] 10 0.0%
The report (actually a single set of four columns) can
be anywhere up to 256 lines long, one line for each possible
arrangement among 8 off-and-on bits within one byte. The program
deliberately omits a heading line and shows only one byte pattern
per line. This enables us to get a clean result when we sort the
report:
SORT /+10 /R < SVP_TXT.BYT > SVP_TXT.BYS
This is the standard DOS SORT routine. The /+10 causes the sort to
start at the tenth column, that is, sort by the frequency. The /R
makes it a reverse frequency sort. The top end of the output looks
like this:
\040 [20] 38883 16.3%
e [65] 21212 8.9%
t [74] 15670 6.6%
o [6F] 15224 6.4%
a [61] 11934 5.0%
i [69] 11036 4.6%
n [6E] 10777 4.5%
If we include the "locations" flag when analyzing
bytes, each line of the output contains more information.
Processing is only two thirds as fast. The command would be:
A_BYTES -L SVP_TXT
The top end of the resulting SVP_TXT.LOC (note the changed report
name) looks like this:
\012 [0A] 4117 1.7% 31 56 120 154 178 244 309 374
\015 [0D] 4117 1.7% 30 55 119 153 177 243 308 373
\032 [1A] 1 0.0% 238311
\040 [20] 38883 16.3% 6 8 14 22 25 38 40 63
! [21] 173 0.1% 2098 2477 2800 3527 3671 6419 8389 8395
The report including locations may be sorted in precisely the same
manner as the BYT report:
SORT /+10 /R < SVP_TXT.LOC > SVP_TXT.LOS
If a file already exists with the target name, A_BYTES
replaces the last character with a digit. BYT becomes BY0 or BY1,
etc., up to BY9. The variations on LOC are LO0 through LO9.
One final note: A_BYTES can be run on a series of files
with a single command:
A_BYTES -L MYFILE HERFILE FILE.WHO WHATEVER
══════════════════════
4.8 Data types
══════════════════════
With a byte survey in hand, we can begin to answer two
questions about a file:
» What is in it?
» How is it presented?
The two questions are answered simultaneously during the analysis.
In the next section, we will look at how data is presented. Let's
focus on the first question. A file may contain one (or possibly
more) of these data types:
ASCII text / extended ASCII text
text with ASCII markup codes
text with binary markup codes
text with BCD (binary coded decimal) packed numbers
text with BASIC packed numbers
text with compression substitutions
EBCDIC
binary (compression, encryption, GIF graphic
interchange files, sound, etc.)
ASCII (American Standard Code for Information
Interchange) is an agreed-upon assignment of bit patterns to
letters, digits, punctuation, control characters, etc. ASCII text
consists of the letters of the printable alphabet, numeric digits,
punctuation characters and space (hex 20 through 76) plus tab,
vertical tab and new page (hex 09, 0B and 0c respectively). The
newline character (hex 0A) is normally present (except in line
records and some fixed length records... see below). If the file
has been processed using a PC, for each newline character, there is
normally exactly one carriage return (hex 0D), immediately
following. PC ASCII text files often have exactly one hex 1A or
CTL-Z as the last character (EOF or end of file marker).
Extended ASCII text for our purposes is ASCII text in
which accented characters appear within French, German, Spanish and
other foreign language words. If we allow also for punctuation
common in Spanish, the extra characters are in the ranges hex 80 to
9A and A0 to A8. These are all "high-bit-set" characters, specific
to PC compatible ASCII data. ("High-bit-set" means that the
leading bit is turned on; hence the value is hex 80 or higher.) The
count of these characters when they are part of extended ASCII text
is generally lower than the count of regular vowels in the same
distribution. Greek and mathematical characters (hex AB, AC, E0
to FD) may also occur in extended ASCII text. We will look later
at how to check context in order to verify whether high-bit-set
characters are valid text, errors within text or indications of a
different file type.
Text with ASCII markup codes uses the same characters
as in ASCII or extended ASCII text. It is common practice to
insert special bytes or series of bytes within data to signal how
the data should be displayed. This practice is called markup.
Examples of items marked:
justify text (left, center, right)
select font (courier, etc.), font subsets (italics)
select type size (pica, elite, with a count)
change to/from bold
underline
protect against "widows" and "orphans"
(ensuring paragraphs stay together)
respond to new heading level
The first indicator of ASCII markup is that some characters (often
"<" and ">") are present in unusually high proportion.
Text with binary markup codes may contain virtually all
256 characters. Null bytes (hex 00) do not occur in ASCII text,
but are common in files with binary content. If you have Microsoft
Word or WordPerfect files handy, try the command A_BYTES on one of
them and look at the listing that results. Notice that lower case
alphabetic letters still figure prominently in the distribution.
Text with packed numbers is found most often in COBOL
style fixed length fielded records. Distribution is like that of
ASCII text except that a sprinkling of other values shows up.
Binary Coded Decimal (BCD) shows up in COBOL records. BASIC
language data files have their own variation of packed numbers.
Text with compression substitutions has binary series
in the midst of normal text. This is notoriously difficult to work
with unless you have access to the decompression table.
EBCDIC data warrants special treatment the moment it is
identified. Work stations or mini computers that receive data on
nine track tape often deal with non-ASCII data. (The problem is
rare with personal computers.) If there are many '@' symbols and
no recognizable text, it is possible that you are working with
EBCDIC and not ASCII. (The reason for lots of '@' characters is
that the EBCDIC space character is the same as the ASCII '@'.)
Try the EBC_ASC program on a portion of the data:
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
usage: ebc_asc ebcdic_input ascii_output
Converts an EBCDIC file to ASCII. EBCDIC (Extended Binary
Coded Decimal Interchange Code) data is commonly produced
by IBM mainframe computers. ASCII (American Standard Code
for Information Interchange) is used on personal computers
and computers produced by the majority of manufacturers.
input: Any EBCDIC file
output: ASCII equivalent
writeup: MIR TUTORIAL ONE, topic 4
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
If samples start to make sense when filtered through an
EBCDIC to ASCII conversion, process all samples through EBC_ASC
before continuing analysis. Later you will need to process all the
data through the same conversion.
Binary data: Suppose you find in every sample that the
data looks completely meaningless. There are many reasons that
data dumps appear to be a jumble. Among them are:
binary compression
encryption
binary numeric data
graphics content
sound files
non-ASCII data
Attempts to decipher binary data are time consuming and expensive.
The process is called reverse engineering. It is not always legal.
Go back to asking people questions (the people providing the data,
and maybe a lawyer) before getting too deep!
═════════════════════════════
4.9 Data presentation
═════════════════════════════
A byte survey is also used as part of the analysis of
how the data is presented. Main options are:
byte stream
line records
fixed length records
blocked records with ASCII lengths
blocked records with binary lengths
Here we simply describe the options. The topic on deblocking data
will describe the programs used. Deblocking, like EBCDIC to ASCII
conversion, may have to be carried out before analysis can be
completed.
One stage of the indexing process will be to create a
byte stream equivalent of the file to be indexed. A byte stream is
the crudest form of file... simply a series of bytes in the order
in which they would be displayed if sent to a screen. The ASCII
source code file for each MIR program is a byte stream.
Line records are blocks of text or other data, padded
with blanks out to some fixed length, very often 80. This form of
data storage goes back to punch cards (remember them?) in which as
many English (or other natural language) words as possible were
fitted into one 80 column card. When the next word would cause
overflow, it was placed at the beginning of the next card. A
telltale sign of line records is a disproportionately high number
of blanks and few if any linefeeds in the byte distribution.
Fixed length records also date back to punch cards.
One or more cards would be divided into groups of columns, with one
group assigned to each field. Sizes of fields are fixed according
to the amount of data that the file designer expects... perhaps
eight columns/bytes for a purchase order number, 30 for a street
address, 10 for a quantity, 55 for an item description, etc.
Unused fields and unused spaces within fields are left blank. Text
fields are normally left justified (extra spaces at the end), and
numeric fields are right justified (leading bytes either blank or
zero padded). Line feeds are rare, and would occur only within a
long text field. The essence of fixed length records is that
location determines meaning. Like line records, fixed length
records show a very high frequency of blanks (hex 20). They are
likely to have more numeric digits than line records. Alternately,
if the records contain Binary Coded Decimal packed numbers, there
will be a near random assortment of high-bit-set bytes.
Blocked records with ASCII lengths are series of byte
streams of variable length, with a measure of length at the front
of each block. This length is typically 4 digits with zero padding
(for example, 0032 or 3217 or 0539). The byte survey of this data
shows high frequencies of digits, especially the digit zero. Line
feeds show up less often than normal, or are non-existent.
Otherwise, the content is fairly typical of ASCII text.
Blocked records with binary lengths have been common in
library data (so called "MARC" records) and in the publishing
business. They are often more sophisticated than blocked records
with ASCII lengths; there may be field sub-lengths within larger
blocks. Byte surveys show a small percentage of randomly
distributed binary characters. Those that have the high bit set
are noticeable. High-bit-off binary bytes are hidden in the
frequencies of normal text characters.
══════════════════════════════
4.10 Byte distributions
══════════════════════════════
Here is the distribution of alphabetical characters in
a sample of English text (actually drafts of seven MIR topics).
The 79,657 characters were as follows:
A 312
B 101
C 288
D 220
E 288
F 152
G 108
H 112
I 406
J 7
K 17
L 175
M 190
N 187
O 292
P 219
Q 17
R 281
S 331
T 447
U 109
V 38
W 117
X 15
Y 81
Z 5
a 6018
b 920
c 2764
d 2775
e 9658
f 1766
g 1482
h 2959
i 5478
j 49
k 378
l 2523
m 1996
n 5194
o 6008
p 1899
q 126
r 5433
s 4975
t 6966
u 2189
v 834
w 1010
x 337
y 1319
z 86
Here is the same data again, arranged from most
frequent to least frequent:
e 9658
t 6966
a 6018
o 6008
i 5478
r 5433
n 5194
s 4975
h 2959
d 2775
c 2764
l 2523
u 2189
m 1996
p 1899
f 1766
g 1482
y 1319
w 1010
b 920
v 834
T 447
I 406
k 378
x 337
S 331
A 312
O 292
E 288
C 288
R 281
D 220
P 219
M 190
N 187
L 175
F 152
q 126
W 117
H 112
U 109
G 108
B 101
z 86
Y 81
j 49
V 38
Q 17
K 17
X 15
J 7
Z 5
Frequency patterns in English text vary with the
subject matter, but never to a large degree. Letters j and z are
little used; letters e and t together with the space character
almost assuredly account for one quarter to one third of all
alphabetic bytes present. Percentages of course are lower when we
include the space character and punctuation. Spaces (blanks) lead
the frequency distribution in virtually all text files.
European languages use the high-bit-set accented
characters. In French, the frequency of unaccented letter e drops
somewhat in favor of è with grave accent (infrequent) and é with
acute accent (common). Spanish uses more of the letter l, the
tilde form of ñ and Ñ, and of course the interrogative and
exclamatory symbols ¿ and ¡ at the beginning of sentences. The PC
high-bit-set accented characters apply as well in Scandinavian
languages, German, and so forth. But no language based on the
roman character set departs altogether from the basic patterns --
frequent use of vowels, certain consonants highly favored and
others marginal across all languages.
Treat each departure from typical distributions as a
prompt to analyze each exceptional byte in its contexts.
To illustrate, let's examine what significance we can
attach to the example byte survey of the file SVP_TXT above. Start
by making your own complete copy with the command:
A_BYTES SVP_TXT
What further analysis should we do?
» The lower case letters have a typical English text
distribution... j, z and q infrequent; e, a, t and o
with very high frequency.
» Upper case letters are also normally distributed for
English text... Q and Z lowest, K very low (rarely the
first letter in a sentence), M appearing together with
the normally high vowels and T. M is frequent at the
beginning of names, and in French greetings (Monsieur,
Madame).
» There are only 70 characters out of 238,312 that have
the high bit set (hex 80 or higher) and all can be
verified to be valid accented French characters that
might appear in names of people and places in French
correspondence.
» The line feed (hex 0A) and carriage return (hex 0D)
each appear 4117 times, normal in an ASCII text file
with 4117 lines. The locations report shows that the
first eight occurrences of 0A 0D are paired right
together.
» The only other non-printing characters are hex 20
(space) and hex 1A (CTL-Z or end of file in DOS). At
16.3% of occurrences, the space is the most frequent
character. The CTL-Z occurs once only, in the very
last byte of the file. Byte number 238311 is the last.
(Offsets count from zero upward and SVP_TXT is a 238312
byte file.) CTL-Z is the standard end of file marker
for ASCII text files.
» We therefore have cumulative evidence that this is an
extended ASCII file. We have not yet established
whether it contains markup.
» Round and square parentheses probably are matched sets
... 38 of ( and ), 67 of [ and ]. The latter are worth
checking whether they occur in normal ways within text.
» The sort by frequency reveals that < and > each appear
1511 times... a tip that they might be used as matching
angle brackets. The location data shows that the first
eight are indeed paired only two or three bytes apart.
Together the < and > bytes comprise 1.2% of the file...
abnormally high for correspondence, and therefore
likely part of some sort of markup code.
» The sort by frequency also shows @ and = occurring 935
times each. Are they matched also and part of a code?
The first eight appearances are within 7 bytes of each
other, with the @ symbol coming first each time.
» We should investigate why ^ shows up 479 times and | is
present 28 times. Neither is normal for correspondence.
The next topic explains how to examine bytes in context
and how to survey patterns across an entire file. Following this
are topics on analysis of detail within each of the data types.